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Anomaly Detection 

This invention relates to anomaly detection, and to a rnettiod, an apparatus and 
computer software for Implementing it- More particularly, although not exclusively, it 
relates to detection of fraud in areas such as telecommunications and retail sates by 
5 searching for anomalies in data. 

It is known to detect fnaud with the aid of fraud management systems which use hand- 
crafted rules to characterise fraudulent behaviour- The rules are generated by human 
experts in fraud, who supply and update them for use in fraud management systems in 
detecting fraud- The need for human experts to generate rules is undesirable because it 
LO is onerous, particularly if the number of possible rules is large or changing at a significant 
rate. 

It is also known to avoid the need for human experts to generate rules: i.e. artificial 
neural networks are known which learn to characterise fraud automatically by processing 
training data. They then detect fraud indicated in other data from characteristics so-! 
15 learned. However, neural networks characterise fraud in a way that is not Immediately 
Visible to a user and does not readily translate into recognisable rules. It is important to ' 
be able to characterise fraud in terms of breaking of acceptable rules, so this aspect of 
neural networks is a disadvantage. 

Known rule-based fraud management systems can detect well-known types of fraud 
2Q because experts know how to construct appropriate rules. In particular, fraud over circuit- 
switching networks is well understood and can be dealt with in this way. However, 
telecommunications technology has changed in recent years with circuit-switching 
networks being replaced by Internet protocol packet-switching networks* which can 
transmit voice and Internet protocol data over telecommunications systems. Fraud 
is associated with Internet protocol packet-switching networks is more complex than that 
associated with circuit-switching networks: this m because in the Internet case, fraud can 
manifest itself at a number of points on a network, and experts are still learning about the 
potential for new types of fraud, Characterising complex types of fraud manually from 
huge volumes of data is a major task. As telecommunications traffic across packet- 
30 switching networks increases, it becomes progressively more difficult to characterise and 
detect fraud. 
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The present invention provides a method of anomaly detection characterised In that it 
incorporates the steps of:- 

a) developing a rule set of at least one anomaly characterisation rule from a training 
data set and any avaflable relevant background knowledge, a rule covering a 

5 proportion of positive anomaly examples of data in the training data set, and 

b) applying the rule set to test data for anomaly detection therein. 

The method of the invention provides the advantage that it obtains rules from data, not 
human experts, and the rules are not invisible to a user. 

Data samples in the training data set may have characters indicating whether or not they 
10 are associated with anomalies. The invention may be a method of detecting 
telecommunications or retail fraud from anomalous data and may employ inductive logic 
programming to develop the rule set 

Each rule may have a form that an anomaly is detected or otherwise by application of the 
rule according to whether or not a condition set of at least one condition associated with 
15 the rule is fulfilled. A rule may be developed by refining a most general rule by at least 
one off 

a) addition of a new condition to the condition set; 

b) unification of different variables to become constants or structured terms in condition. 

A variable in a rule which is defined as being in constant mode and is numerical is at 
20 least partly evaluated by providing a range of values for the variable, estimating an 
accuracy for each value and selecting a value having optimum accuracy. The range of 
values may be a first range with values which are relatively widely spaced, a single 
optimum accuracy value being obtained for the variable, and the method including 
selecting a second and relatively narrowly spaced range of values in the optimum 
25 accuracy value's vicinity, estimating an accuracy for each value in the second range and 
selecting a value in the second range having optimum accuracy. 

The method may include filtering to remove duplicates of rules and equivalents of rules, 
i.e. rules having iike but differently ordered conditions compared to another rule, and 
rules which have conditions which are symmetric compared to those of another rule. It 
so may include filtering to remove unnecessary less than or equal to* pteq") conditions. 
Unnecessary "Iteq" conditions may be associated with at least one of ends of intervals, 
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multiple Iteq predicates and equality condition and iteq duplication. 

The method may include implementing an encoding length restriction to avoid overfitiing 
noisy data by rejecting a rule refinement if the refinement encoding cost in number of bits 
exceeds a cost of encoding the posftive examples covered by the refinement. 

Rule construction may stop if at least one of three stopping criteria is fulfilled as follows: 

a) the number of conditions in any rule in a beam of rules being processed is greater 
than or equal to a prearranged maximum rule length, 

b) no negative examples are covered by a most significant rule, which is a rule that: 

i) is present in a beam currently being or having been processed, 

ii) is significant, 

ill) has Obtained a highest likelihood ratio statistic value found so far, and 

iv) has obtained an accuracy value greater than a most general rule accuracy 
value, and 

c) no refinements were produced which were eligible to enter the beam currently 
being processed in a most recent refinement processing step (32). 

A most signrficant rule may be added to a list of derived rules and positive examples 
covered by the most significant rule may be removed from the training data set 

The method may include; 

a) selecting rules which have not met rule construction stopping criteria, 

20 b) selecting a subset of refinements of the selected rules associated with accuracy 
estimate scores higher than those of other refinements of the selected rules, and 

c) iterating a rule refinement, filtering and evaluation procedure (32 to 38) to identify any 
refined rule usable to test data. 

In another aspect, the present invention provides computer apparatus for anomaly 
25 detection characterised in that it is programmed to execute the steps of> 

a) developing a rule set of at least one anomaly characterisation rule from a training 
data set and any available relevant background knowledge, a rufe covering a 
proportion of positive anomaly examples of data in the ttaining data set T and 
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b) applying the rule set to test data for anomaly detection therein. 

In a further aspect, the present invention provides computer software for use in anomaly 

detection characterised in that it incorporates instructions for controlling computer 

apparatus to execute the steps of:- 
5 a) developing a rule set of at least one anomaly characterisation rule from a training 
data set and any available relevant background knowledge, a rule covering a 
proportion of positive anomaly examples of data in the training data set, and 

b) applying the rule set to test data for anomaly detection therein. 

The computer apparatus and computer software aspects of the invention may have 
10 preferred features equivalent mutatis mutandis to those of the method aspect. 

in order that the invention might be more fully understood, an embodiment thereof will 
now be described, by way of example only, with reference to the accompanying 
drawings, in which:- ; _ 

Figure 1 is a flow diagram illustrating a procedure for characterisation of fraudulent 
15 transactions in accordance with the invention; and 

Figure 2 is another flow diagram illustrating generation of a rule set for use In 
characterisation of fraudulent transactions in the Figure 1 procedure. 

One example of an application of anomaly detection using the invention concerns 
characterisation of retail fraud committed in shops by cashiers. The invention in this 
20 example may be used in conjunction with current commercial systems that can measure 
and record the amount of money put into and taken out of cashiers* tills. Various kinds of 
cashier behaviour may indicate fraudulent or suspicious activity. 

In this example of the invention transactions from a number of different cashiers 5 tills 
were employed. Each transaction was described by a number of attributes including 

25 cashier identity, date and time of transaction, transaction type (e,g, cash or non-cash) 
and an expected and an actual amount of cashjn a tiil before and after a transaction. 
Each transaction is labelled with a single Boolean attribute which indicates "true" if the 
transaction is known or suspected to be fraudulent and "false" otherwise. Without 
access to retail fraud experts, definitions of background knowledge were generated in 

30 the form of concepts or functions relating to data attributes. One such function calculated 
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a number of transactions handled by a specified cashier and having a discrepancy: here 
a discrepancy is a difference in value between aGtuaf and expected amounts of cash in 
the till before and after a single transaction. 



In this example, the process of the invention derives rules from a training data set and 
5 the definitions of basic concepts or functions associated with data attributes previously 
mentioned. It evaluates the rules using a test data set and prunes them if necessary* The 
rules so derived may be sent to an expert for verification or loaded directly into at fraud 
management system for use in fraud detection. To detect fraud, the fraud management 
system reads data defining new events and transactions to determine whether they are 
10 described by the derived rules or not When an event or transaction is described by a 
rule then an alert may be given or a report produced to explain why the event was 
flagged up as potentially fraudulent* The fraud management system will be specific to a 
fraud application. 

Benefits of applying the invention to characterisation of telecommunications and rete.iL 
15 fraud comprise: 

* Characterisations in the form of rule sets may be learnt automatically (rather 
than manually as in the prior art) from training data and any available 
background knowledge or rules contributed by experts- this reduces coste 
and duration of the characterisation process; 

20 * Rule sets which are generated by this process are human readable and are 

readily assessable by human experts prior to deployment within a fraud 
management system; and 

* the process may employ relational data, which is common in particular 
applications of the invention - consequently facts and transactions which are 

25 in different locations and which are associated can be linked together. 

The process of the invention employs inductive logic programming software implemented 
in a logic programming language called Prolog. This process has an objective of creating 
a set of rules that characterises a particular concept, the set often being called a concept 
description. A target concept description in this example is a characterisation of 
30 fraudulent behaviour to enable prediction of whether an event or transaction is fraudulent 
or not. The set of rules should be applicable to a new, previously unseen and unlabeled 
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i 

transaction and be capable of indicating accurately whether it is fraudulent or not. 

A concept is described by data which in this example is a database of events or 
transactions that have individual labels indicating whether they are fraudulent or non- 
fraudulent. A label is a Boolean value, 1 or 0, indicating whether a particular event or 
5 transaction is fraudulent (1) or not (0). Transactions labelled as fraudulent are referred 
to as positive examples of the target concept and those labelled as non-fraudulent are 
referred to as negative examples of the target concept 

In addition to receiving labelled event/tremsactional data, the inductive logic programming 
software may receive input of further information, i.e. concepts, facts of interest or 
10 functions that can be used to calculate values of interest e.g. facts about customers and 
their accounts and a function that can be used to calculate an average monthly bill of a 
given customer. As previously mentioned, this further information is known as 
background knowledge, and is normally obtained from an expert in the relevant type of 
fraud. , ■'• 

is As a precursor to generating a rule set, Before learning takes place, the labelled 
eventflransactional data is randomly distributed into two non-overiappirig subsets - a 
training data set and a test data set. Here non-overlapping means no data item is 
common to both subsets. A characterisation or set of rules is generated using the 
training data set. The set of rules is then evaluated on the test data set by comparing 
20 the actual fraudulent or otheiWise label of each event/transaction with the equivalent 
predicted for it by the inductive logic programming software. This gives a value for 
prediction accuracy - the percentage of correctly assessed transactions in the test data 
set. Testing on a different data set of hitherto unseen examples, i.e. a set other than ths 
training data set, is a good indicator of the validity of the rule set. 

25 The target concept description is a set of rules in which each rule covers or characterises 
a proportion of the positive (fraudulent) examples of data but none of the negative (non- 
fraudulent) examples. It is obtained by repeatedly generating individual rules. When a 
rule is generated, positive examples' which it covers are removed from the training data 
set. The process then iterates by generating successive rules using unremoved positive 

30 examples. I.e those still remaining in the training data set. After each iteration, positive 
examples covered by the rule most recently generated are removed. The proaess 
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continues until there are too few positive examples remaining to allow another rule to be 
generated, This is known as the sequential covering approach, and is published in 
Machine Learning, T. Mitchell, McGraw-Hill, 1997. 

Referring to Figure 1, a process 10 involving applying the inductive logic programming 
5 software (referred to as an ILP engine) at 12 to characterising fraudulent transactions is 
as follows. Background knowledge 14 and a training data set 16 are input for processing 
at 12 by the ILP engine, which ptocesses them to produce a set of rules 18. Rule set 
performance is evaluated at 20 using a test data set 22. 

Referring now also to Figure 2 5 processing 1 2 to generate a set of rules is shown in more 
10 detail. Individual rules have a form- as follows: 

IF {set of conditions} THEN {behaviour Is fraudulent} (1 ) 

A search for each individual rule begins at 30 with a most general rule (a rule with no 
conditions): searching is iterative (as will be described later) and generates a succession 
of rules, each new rule search beginning at 30. The most general rule is: 

IS IF { J THEN targeLpredicate is true (2) ■ 

This most general rule is satisfied by all examples, both positive and negative, because it 
means that all transactions and facts are fraudulent. It undergoes a process of 
refinement to make it more useful There are two ways of producing a refinement to a 
rule as follows: 

20 • addition of a new condition to the IF{ } part of the rule; 

* unification of different variables to become constants or structured terms; 

Addition of a new condition and unification of different variables are standard 
expressions for refinement operator types though their implementation may differ 
between systems. A condition typically corresponds to a test on some quantity of 
25 interest, and tests are often implemented using corresponding functions in the 
background knowledge . When a new condition is added to a rule, its variables are 
unified with those In the rest of the rule according to user-specified mode declarations. 
Unification of a variable X to a variable Y means that all occurrences of X in the rule will 
be replaced by Y. A mods declaration for a predicate specifies the type of each variable 
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and its mode, A variable mode may be input output, or a constant Only variables of the 
same type can be unified- Abiding by mode rules reduces the number of refinements 
than may be derived from a single rule and thus reduces the space of possible concept 
descriptions and speeds up the learning process. There may be more than one way of 
5 unifying a number of variables in a rule, in which case there will be mare than one 
refinement of the rule. 

For example, a variable X may refer to a list of items. X could be unified to a constant 
value I ] which represents an empty list or to [Y|Z] which represents a non-empty list with 
a first element variable Y and the rest of the list is represented by another variable Z. 
10 Instantiating X by such unification constrains fts value. In the first case, X is a list with no 
elements and fn the second case it must be a non-empty list, Unification acts to refine 
variables and rules that contain them. 

Variables that are defined as being in constant mode must be instantiated by a constant 
value. Variables of constant type can further be defined by the user as either non L 
numerical or numerical constants. 

- [f a constant is defined as non-numerical then a list of possible discrete values for the 
constant must also be specified by a user in advance. For each possible value of the 
constant, a new version of an associated refinement is created in which the value is 
substituted in place of the corresponding variable. New refinements are evaluated using 
20 an appropriate accuracy estimate and the refinement giving the best accuracy score is 
recorded as the refinement of the original rule. 

if a constant is specified as numerical, it can be further defined as either an Integer or a 
floating-point number. A method for calculating a best constant in accordance with the 
invention applies to both integers and floating point numbers, If a constant is defined a© 
25 numerical then a continuous range of possible constant values must be specified by a 
user in advance. For example, if the condition was ,r mirtutes_pasUhe_haur(X/ then X 
could have a range 0-59. 

In an integer constant search, rf a range or interval length for a particular constant is less 
than 50 in length, all integers (points) in the range are considered. For each of these 
30 integers, a new version of a respective associated refinement is created in which the 
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relevant integer is substituted in place of a corresponding variable and new rules are 
evaluated and given an accuracy score using an appropriate accuracy estimation 
procedure. The constanrt(s) giving a best accuracy score is(are) recorded. 

If the integer interval length Is greater than SO, then a recursive process is carried out as 
5 follows: 

1- A proportion of the points (which are evenly spaced) in the interval length are 
sampled to derive an initial set erf constant values. For example, in the 
l, minutelSLpas1 w the_hour(X) ,, example, 10, 20, 30, .40 and SO minutes might be 
' sampled- For each of these values, a new version of a respective refinement is 
10 created in which the value is substituted in place of a corresponding variable and a 

respective rule is evaluated for each value together with an associated accuracy 
estimate. 

2. a. if a single constant value provides the best score then a number of the values (the 
number of which is a user selected parameter in the ILP engine 1 2) either side of this; 
15 value are sampled. For instance, if the condition minutes __past_the_hour(2Q} gave 
the best accuracy then the following more precise conditions may then be evaluated: 

f 

• minut3s_past_the_hour(1S) 

• minutes_pastJhe_hour(ie) 

• m!nutes_j3ast„thej70ur(17) 
20 * minutes __pa&t_th&_Jiour(18) 

» mlnutes_j}astJthB_hout(19) 
» rninutQs jpast_th&_hour(21) 

* minute$jpastJEhe„haur(22) 

* minut6s_pBst_th&_hour(23) 
25 * minutes_past_the_hour(24) 

* rnmutBs_past_the_hour(25) 

if a single constant vafue in X = 15 to 25 gives the best accuracy score then that value is 
chosen as a final value of the constant X. 
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2. b. If more than one constant value provides the best score then if they are 
consecutive points in tha sampling then the highest and lowest values are taken and 
ths values In their surrounding intervals are tested. For example, if 
minute$„pasLJheJ-tQur(2Q)s minutes_pasUhe„hour(3Q) and 

5 minutes jotast__th&_hout(40) all returned the same accuracy then the following points 
would be tested for accuracy : 

* minutes ^past_the_hour(15) 

* minutes_past_the_hour(l6) 

• minutes jDasUheJ\out(1 7) 
10 * minutes_pastJheJhQUf(18) 

* minules^pastjthejhouri 19) 

# minutBS_pasi_Jh&__hour(4 1) 

* minutes_^pastJtheJiQur(42) 

• minutas_past_th€L_hazir{43) 
15 • mmutes_pa$tjthejhaur(44) 

• mmutesjpastJtheJiouf(45) 

If the accuracy score decreases at an integer value N in the range 15 to 19 or 41 to 45, 
then (N"1 ) is taken as the constant in the refinement of the relevant rule. 

2. a If a plurality of constant values provides the best accuracy score, and the values 
20 are not consecutive sampled points then they are arranged into respective subsets of 
consecutive points. The largest of these subsets is selected, and the procedure for a 
list of consecutive points is followed as at 2b above: e.g. if 
minutes ^pa&Ltfle_haur(20) t minut&s^jmstJthe_hour(3Q) and 

minutes _pastJthe_hour(5Q) scored best then the subset minutes^astjhoj\aut(20) 
25 - mmutes^a&tJheJhQur(3Q) would be chosen. If the largest interval consists of only 
one value, then the procedure for a single returned value is followed as at 1 « above, 

TV" 

The user can opt to conduct a beam constant search: here a beam is an expression 
describing generating a number of possible refinements to a rule and recording all of 
them to enable a choice to be' made between them later uvhen subsequent refinements 
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have been generated. In this example, N refinements of a rule, each with a different 
constant value are recorded. This can be vary effective, as the 'best' constant with 
highest accuracy at one point In the refinement process 32 may not turn out to be the 
'best' value over a series of repeated refinement iterations. This avoids the process 32 
5 getting fixed in local non-optimum maxima- 
Same variables in conditions/rules may be associated with multiple constants: if so each 
constant associated with such a variable is treated as an individual constant, and a 
respective best value for each is found separately as described above. An individual 
constant value that obtains a highest accuracy score for the relevant rule is kept and the 
io corresponding variable is instantiated to that value. The remaining variables of constant 
type are instantiated by following this process recursively until all constant type vaFiables 
have been instantiated (i.e. substituted by values). 

Once all refinements of a rule have been found, in accordance with the invention, the 
refinements are filtered at 34 to remove any rules that are duplicates or equivalents of 
others in the set Two rules are equivalent in that they express the same concept if their 
conditions in the IF {set of conditions} part of the rule are the same but the conditions are 
ordered differently. For example. IF {set of conditions} consisting of two conditions A 
and B is equivalent to IF {set of conditions} with the same two conditions in a different 
order, Le. B and A. One of the two equivalent rules is removed from the list of 
refinements and so is not considered further during rule refinement, which reduces the 
processing burden- 
Additionally, in accordance with the invention, symmetric conditions are not allowed in 
any ruie. For exampie, a condition equal(X,2) means a variable X is equal in value to 2, 
is symmetric to equal^X), i.e. 2 is equal in value to a variable X. One of the two 
25 symmetric rules is removed from the list of refinements and so is not considered further. 

Pruning refinements to remove equivalent rules and symmetric? conditions results in 
fewer rules to consider at successive iterations of the refinement process 32, so the 
whole rule generation process is speeded up. Such pruning can reduce rule search 
space considerably, aibert the extent of this reduction depends on what application is 
30 envisaged for the invention and how many possible conditions are symmetric; in this 
connection where numerical variables are involved symmetric conditions are usually 
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numerous due to the use of 'equals 1 conditions such as equal{Y,X). For example, in the 
retail fraud example, the rule search space can be out by up to a third. 

A less than or equals' condition referred to as 'lteq r 5 and an 'equals 1 conditions are often 
used as part of the background knowledge 14. They are very useful conditions for 
5 comparing numerical variables within the data. For this reason, part of the filtering 
process 34 ascertains that equals and tteq conditions in rules meet checking 
requirements as follows: 

• End of interval check: This checks the end of intervals where constant values are 
involved: e.g. a condition fteq(A f 1000) means variable A is less than or equal to 
1000: it is unnecessary if A has a user-defined range of between 0 and 1000, so 
a refinement containing this condition is removed In addition, Iteq(l000 a A), ictio 
is less than or equal to A, should be equais(A, 1000) as A cannot be more than 
1 000, Therefore, refinements containing such conditions are rejected, 

» Multiple s lteq' predicate check: If two conditions lteq(A,X) and fteq(B,X) where A 
and B are constants, are contained in the body of a rule, then one condition may 
be removed depending on the values of A and B. For example, if lteq(3D 3 X) and 
lteq{40,X) both appear in a rule, then the condition lteq(30 3 X) is removed from ths 
rule as being redundant, because if 40 is less then or equal to X then so also Is 
30. 

20 * Equals and tteq duplication check: in accordance with the invention if the body of 
a rule contains both conditions lteq(C, Constant) and equa1s(C, Constant), then 
only the equals condition is needed. Therefore, refinements containing fteq 
conditions with associated equals conditions of this nature are rejected. 

Rule refinements are also filtered at 34 using a method called 'Encoding Length 
IS Restriction' disclosed by N. Lavrac and S. DzeroSki, Inductive Logic Programming: 
Techniques and Applications- Ellis Horwood, New York, 1994. It fs based on a 'Minimum 
^Description Length' principle disclosed by B. Pfahringer, Practical Uses of the Minimum 
Description Length Principle in Inductive Learning, PhD Thesis, Technical University of 
Vienna, 1995. 
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Where training examples are noisy (i.e. contain incorrect or missing values), it is 
desirable to ensure that rules generated using the invention does not overfit data by 
treating noise present jn the data as requiring fitting Rule sets that overfit training data 
may include some very speorfio rules that only cover a few training data samples. In 
5 noisy domains, it is likely that these few samples will be noisy: noisy data samples are 
unlikely to indicate transactions whiGh are truly representative of fraud, and so rules 
should not be derived to cover them. 

The Encoding Length Restriction avoids overfitting noisy data by generating a rule 
refinement only as long as the cost of encoding the refinement will not exceed the cost of 
10 encoding the positive examples covered by the refinement where 'cost' means number of 
bits. A refinement is rejected if this cost criterion is not met. This prevents rules 
becoming too specific, i.e. covering few but potentially noisy transactions. 

Once a rule is refined, the resulting refinements are evaluated in order to identify those 
which are best. Rules are evaluated at 3S by estimating their classification accuracy., 

IS This accuracy may be estimated using an expected classification accuracy estimate" 
technique disclosed by ISL tavrac and S. Dzeroski, Inductive Logic Programming J 
Techniques and Applications. Ellis Horwood, New York, 1994, and by F. Zelezny and N.V 
Lavrac, An Analysis of Heuristic Rule Evaluation Measures, J. Stefan Institute Technical 
Report, March 1999- Alternatively, it may be estimated using a weighted relative 1 

20 accuracy estimate disclosed by N. Lavrac, P. Flach and B. Zupan, Rule Evaluation 
Measures: A Unifying View, Proceedings of the 9th International Workshop on Inductive 
Logic Programming (ILP-99), volume 1634 of Lecture; Notes in Artificial Intelligence, 
pages 174-185, Springer-Verfag, June 1999. A user may decide which estimating 
technique is used to guide a rule search through a I hypothesis space during ruEe 

25 generation. 

Once refinements have been evaluated in terms of accuracy* -they are then tested for 
what is referred to in the art of rule generation as 'significance 1 , In this example a 
significance testing method is used which is based on a likelihood ratio statistic disclosed 
in the N. Lavrac and Dzeroski reference above. A rule is defined as 'significant* if its 
30 likelihood ratio statistic value is greater than a predefined threshold set by the user. 

If a rule covers n positive examples and m negative examples, an optimum outcome of 
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refining the rule is that one of its refinements {an optimum refinement) will cover n 
positive examples and no negative examples. A likelihood ratio for this optimum 
refinement can be calculated. A rule is defined as 'possibly significant 5 if its optimum 
refinement is significant. Note that ft is possible that a rule may not actually be 
5 significant, but it may be possibly significant in accordance with this definition. 



A first rule under consideration in the process 12 is checked at 33 to sea whether or not 
ft meets rule construction stopping criteria: In this connection, the construction of an 
individual rule terminates when any one or more of three stopping criteria is fulfilled as 
follows: 

10 1 . the number of conditions in any rule in a beam (as defined earlier) currently being 
processed is greater than or equal to a maximum rule length specified by the 
user, If a most significant rule (see at 2. below) exists this is added to the 
accumulating rule set at 40, 

2. a most signfficant rule covers no negative examples - where the most significant 
is rule is defined as a rule that is either present in the current beam, or was present 

in a previous beam, and this rule: 

a) is significant, 

b) obtained the highest likelihood ratio statistic value found so far, and 

c) obtained an accuracy value greater than the accuracy value of the most 
20 genera! rule (that covers all examples, both positive and negative), and 

3. the previous refinement step 32 produced no refinements eligible to enter the 
new beam; if a most significant rule exists it is added to the accumulating rule set 
at 40. 

Note that a most significant rule may not necessarily exist, If so no significant 
25 refinements have been found so fan If it is the case that a most significant rule does not 
exist but the stopping criteria at 38 is satisfied, then no rufe is added to the rule set at 40 
and the stopping criteria at 44 Will be satis?fied (as will be described later) > 

When a rule is added at 40, the positive examples it covers are removed from the 
training data at 42, and remaining or unremoved positive and negative examples form a 
30 modified training data set for a subsequent iteration (ff any) of the rule search. 
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At 44 a cheek is made to see whether or not the accumulating rule set satisfies stopping 
criteria, in this connection, accumulation of the rule set terminates at 46 (finalising thB 
rule set) when either of the following criteria is fuffilled, that is to say when either: 

* construction of a rule fs terminated because a most significant rule does not exist, 
5 or 

* too few positive examples remain for further rules to be significant 

If at 44 the accumulating rule set does not satisfy the rule set stopping criteria, another 
most general rule is selected at 30 and accumulation of the rule set iterates through 
stages 32 eft?. At any given time in operation of the rule generation process 12, there are 
10 a number (zero or more) ruies for which processing has terminated and which have been 
added in the accumulating rule set, and there are (one or more) evolving rules or proto- 
rules for which processing to yield refinements continues iterativefy. 



If evolving rules are checked at 38 and are found not to meet any of the rule construction 
stopping criteria previously mentioned, those refinements of such rules are chosen which 

15 have the best accuracy estimate scores. The chosen refinements then provide a basis v 
for a next generation of rules to be refined further in subsequent refinement iterations, ■ „ 
The user defines the number erf refinements forming a new beam to be taken to a further 
iteration by fixing a parameter called *beam_widttf. As has been said, a beam is a 
number of recorded possible refinements to a rule from which a choice will be made 

20 later, and beam_wldth is the number of refinements in it. For a beam width N, the 

refinements having the best N accuracy estimate scores are found and taken forward at 
48 as part of the new beam to the next iteration. The sequence of stages 32 to 33 then 
Iterates for this new beam via a loop 50. 

Each refinement entering the new beam must: 
25 * be possibly significant (but not necessarily significant), and 

* improve upon or equal the accuracy of its parent rule (the rule from which it was 
derived by refinement previously). 

If required by the user, the accumulated rule set can be post-pruned using a reduced 
error pruning method disdosed by J. Furnkranz, A Comparison of Pruning Methods for 
30 Relational Concept Learning, Proceedings of AAA! '94 Workshop on Knowledge 
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Discovery in Databases (KDD-94), Seattle, WA, 1994. In this case, another set of 
examples should be provided - a pruning set of examples. 

Examples of a small training data set, background knowledge and a rule set generated 
therfrom will now be given. In practice there may be very large numbers of data samples 
5 in a data set 

Training data 

The trafnfng data is a transaction database, represented as Prolog facts in a format 
as follows: 

10 transffrans ID, Date, Time, Cashier, Expected amount in tili, Actual amount In 

tiJf f Suspicious Flag). Here 'trans' and Trans' mean transaction and ID means 
identity. 

A sample of an example set of transaction data is shown below. Transactions 
with Suspicious Flag = 1 are fraudulent (positive examples), and with 
15 Suspicious Flag = 0 are not {negative examples) Tha individual Prolog facts 

were: 

tfans(1 ,30/0a/2003,09:02,cashier_l ,121.87,123.96, 0), 
trans(2 3 30/08/2003,08:55,cashier„1 ,1 19.38,121.82, 0). 
^3(3,30/08/2003*08:50^3^6^1,118.59,119.38, 0). 
2D trans(4 p 30/OS/2003 a OS:48,cashier_1 ,1 16.50,1 1 8.59, 0). 

trans{5 r 30/08/2003 f 08:44,cashfer_1 s 1 1 5.71 ,1 1 6.50, 0). 
trans(6 J 30/O8/2003 l 22:40,cashier„2 a 431 .68,435,1 7, 0). 
trans<7,30/08/2003>5S:37.cashier_2,423.70,431 .68, 1). 
trans(8 3 3a/OS/2003 f 22:35,cashier_2 r 420.01 ,423,70, 0), 

25 

Background knowledge: 

Examples of appropriate background knowledge concepts, represented using 

Prolog, are: 

30 discrepancy (Trans J D, Discrepancy). 

This gives the discrepancy in UK £ and pence between the expected amount of 
cash in a till and the actual amount of cash in that tiii for a particular transaction, 
e.g,: 

discrepancy (1, 2.09). 
35 discrepancy {2, 2.44). 
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discrepancy(7, 7.98). 

tota[jrans(Cashier> Total number of transactions, Month and Year). 
This gives the total number of transactions made by the cashier in a given 

month and year, e.g.: 

tota!_trans(cashier_1 , 455, OS/2003). 

totaL_trans(cashier_2 7 345 P 08/2003). 

nurnber_of jTOns_wiih_discrepancy{Cashfer a Number, Month and Year). 
This gives the total number of transactions with a discrepancy made by a 
cashier In a given month and year t e.g.: 

number_oL^nsi_with_discnepancy(c?ashler_1 , 3B, 08/2003). 
number_of_trans_with_discrepancy(cashier_2 T 93, 08/2003). 

number_oOrans_with_discrepanc^_greatsrj:han(Cashier l Number, Bound, 
Month and Year). 

15 This gives the total number of transaction with a discrepancy greater than some 

•> 

bound made by a cashier in a given month and year, e.g.: 

number_of_ti^ans_with_discrBpancy_greaterj:hanCqashier_1 ,5,100,08/200 

3). f 
number_of_trans_vtfth_d^ 
20 3). 

number_cOrans_with_discrepancy_greater_than(cashier^2 I 1 5,1 00,08/20 
03) 

number_0Orans„with_dis^ 

25 discrepancy(Trans_ID f Discrepancy). 

This gives the discrepancy between the expected amount of cash in the till and 
the actual amount of cash in the till for a particular transaction, e.g.: 
discrepancy(1 , 2.09). 
discrepartcy(2, 2.44). 
30 discrepancy{7, 7.98). 

tetfaLtransCCashier, Total number of transactions, Month and Year). 
This gives the total number of transactions made by the cashier in a given 
month and year, e.g.: 

totaLtrans(cashier_i , 455, 08/2003). 
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tota[_trans(cashier_2 s 345, OS/2003). 

numbeL.otJrans_with_disorepancy(Cashifer T Number, Month and Year), 
This gives the total number of -transactions with a discrepancy made by a 
5 cashier in a given month and year, e.g.: 

number_of_jrans_with_drscrepancy(cashiec1 > 38, 08/2003). 
number_of_trans_with_discrepancy{cashfer_2, 93, 08/2003). 

numb^ofjrans^with.discrepancy^greaterjhan^ashier, Number, Bound, 
1Q Month and Year). 

This gives the total number of transaction with a discrepancy greater than some 
bound made by a cashier in a given month and year, e.g.: 

number_oUrans.with_discrepancy_greater Jhan(cashier_i T 5, 1 00,08/200 

3). 

. is number_of_trans jmth_discre^ ,3,150,03/200 

3). 

number_of Jrans_with_discrepancy„greater_than(Gashier_2,1 5,1 00,08/20 
03) 

number_ofJrans_with_d^ 

2Q 3) 

Generated rule set: 

The target concept fe fraudulent(Cashier). The rule set characterises a eashrer who 
has made fraudulent transactions. 
fraudulent(Cashier) 

25 number„ot_trans_with_discrepancy_g reater _than(Cashier, Discrepancies, 

100,. Month), 

Discrepancies £10. 
fraudulent(Cashier) 

totaLtransCCashisr, TotaLTrans, Month), 
30 TotaLTrans > 455, 

numbsr_oOrans_vrtth_discrepancy(Cashier a Discrepancies, Month), 

Discrepancies 6 230/ 

This example of a generated rule set characterises fraudulent cashiers using two rules. 
The first rule indicates that a cashier is fraudulent if that in a single month, the cashier 
35 has performed at least 1 0 transactions with a discrepancy greater than 1 00. 



27-NOU-2003 17=3? FROM IP MRLUERN 



TO UK PATENT 



P. 22 



The second rule describes a cashier as fraudulent if in a single month, the cashier has 
carried out at least 455 transactions, where at least 230 of these have had a discrepancy 
between the expected amount and the actual transaction amount 

The embodiment of the invention described above provides the following benefits: 

S - It is fast because it prunes out duplicate rules avoiding unnecessary processing; 

« It can deal with and tune nunn&rical and non-numerical constants to derive rules 
that bound variables (e.g. IF transaction value is between £19.45 and £67.89 
THEN ...); 

» It can make use of many different heuristics (decision techniques e,g* based on 
10 score© for accuracy), which can be changed and turned on or off by a user; 

* It uses a weighted relative accuracy measure in rule generation; 

» It develops rufes that are readable and its reasoning can be undBrstood (unlike 
a neural network for example); 

» It can be tuned to a particular application by adjusting its parameters and 
15 changing/adding heuristics; 

* It can use relational and structural data that can be expressed in Prolog; 

* It can process numerical and non-numerical data; and 

* It can make use of expert knowledge encoded in Prolog. 

The process undertaken by the ILP engine at 12 as set out in the foregoing description 
20 can clearly be evaluated by an appropriate computer program comprising program 
instructions embodied in an appropriate carrier medium and running on a conventional 
computer system. The computer program may be embodied in a memory, a floppy or 
compact or optical disc or other hardware recordal medium, or an electrical signal. Such 
a program is straightforward for a skilled programmer to implement on the basis of the 
25 foregoing description without requiring invention, because it involves well known 
computational procedures. 
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Claims 

1 . A method of anomaly detection characterised in that it incorporates the steps of:- 

a) developing a rule set of at least one anomaly characterisation rule from a 
training data set and any available relevant backgn&und knowledge, a rule 
covering a proportion of positive anomaly examples of data in the training data 
set, and 

b) applying the rule set to test data for anomaly detection therein. 

2. A method according to Claim 1 characterised in that date samples in the training data 
set have characters indicating whether or not they are associated with anomalies, 

3. A method according to Claim 2 characterised in that it is a method of detecting 
telecommunications or retail fraud from anomalous data. 

4. A method according to Claim 3 characterised In that it employs inductive logic 
programming to develop the rule set. 

5. A method according to Claim 4 characterised in that each rule has a form that an 
anomaly detected or otherwise by application of the rule according to whether or not 
a condition set of at least one condition associated with the rule is fulfilled. 

6. A method according to Claim 5 characterised in that each rule is developed by refining 
a most general rule by at least one of: 

a) addition of a new condition to the condition set; 

b) unification of different variables to become constants or structured terms in 
condition. 

7. A method according to Claim 6 characterised in that a variable in a rule which is 
defined as being in constant mode and is numerical is at least partly evaluated by 
providing a range of values for the variable, estimating an accuracy for each value and 
selecting a value having optimum accuracy. 

8- A method according to Claim 7 characterised in that the range of values is a first 
range with values which are relatively widely spaced, a single optimum accuracy value 
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is obtained for the variable, and the method includes selecting a second and relatively 
narrowly spaced range of values in the optimum accuracy va!ue T s vicinity, estimating 
an accuracy for each value in the second range and selecting a value in the second 
range havfng optimum accuracy. 

9. A method according to Claim 4 characterised in that it includes filtering to remove 
dupficates of rules and equivalents of rules, i.e. rules having like but differently 
ordered conditions compared to another rule, and rules which have conditions whfGh 
are symmetric compared to those of another rule. 

10, A method according to Claim 4 or 9 characterised in that it includes filtering to remove 
unnecessary 'less than or equal to' ("iteq") conditions* 

11, A method according to Claim 10 characterised in that the unnecessary "tteq" 
conditions are associated with at least one of ends of intervals, multiple Iteq 
predicates and equality condition and Iteq duplication. 

12. A method according to Claim 4 characterised in that it includes implementing an 
encoding length restriction to avoid overfitting noisy data by rejecting a rule refinement 
if the refinement encoding cost in number of bits exceeds a cost of encoding the 
positive examples covered by the refinement. 

13- A method according to Claim 4 characterised in that it includes stopping construction 
of a rule if at (east one of three stopping criteria fs fulfilled as follows: 

a) the number of conditions in any rule in a beam of rules being processed i© 
greater than or equal to a prearranged maximum rule length, 

b) no negative examples are covered by a most significant rule, which is a rule 
that: 

i) i© present in a beam currently being or having been processed, 

ii) is significant, 

iii} has obtained a highest likelihood ratio statistic value found so far, and 

iv) has obtained an accuracy value greater than a most general rule 
accuracy value, and 

c) no refinements were produced which were eligible to enter the beam currently 
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being processed in a most recent refinement processing step (32). 

14. A method according to Claim 13 characterised in that it includes adding the most 
significant rule to a list of derived rules and removing positive examples covered by 
the most significant rule from the training data set. 

15. A method according to Claim 4 characterised in that it includes:; 

a) selecting rufes which have not met rule construction stopping criteria, 

b) selecting a subset of refinements of the selected rufes associated with 
accuracy estimate scores higher than those off other refinements of the 
selected rules, and 

c) iterating a rule refinement, filtering and evaluation procedure (32 to 38) to 
identify any refined rule usable to test data. 

16. Computer apparatus for anomaly detection characterised in that it is programmed to 
execute the steps of :~ 

a) developing a rule set of at least one anomaly characterisation ruie from a 
training data set and any available relevant background knowledge, a rule 
covering a proportion of positive anomaly examples of data in the training data 
set, and 

b) applying the rule set to test data for anomaly detection therein. 

17. Computer software for use in anomaly detection characterised in that it incorporates 
instructions for controlling computer apparatus to execute the steps of:- 

a) developing a rule set of at least ona anomaly characterisation ruie from a 
training data set and any avaiiabie relevant background knowledge, a rufe 
covering a proportion of positive anomaly examples of data in the training data 
set, and 

b) applying the rule set to test data for anomaly detection therein. 
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ABSTRACT 

A method of anomaly detection applicable to telecommunications or retail fraud uses 
inductive logic programming to develop anomaly characterisation rules from relevant 
background knowledge and a training data set, which includes positive anomaly samples of 
5 data covered by rules. Data samples include 1 or 0 indicating association or otherwise wfth 
anomalies. Am anomaly is detected by a rule having condition set which the anomaly fulfils. 
Rules are developed by addition of conditions and unification of variables, and are filtered to 
remove duplicates, equivalents, symmetric rules and unnecessary conditions. Overfilling of 
nofsy data is avoid by an encoding cost criterion. Termination of rule construction involves 
id criteria of rule length, absence of negative examples, rule significance and accuracy, and 
absence of recent refinement. Iteration of rule construction involves selecting rules with 
unterminated construction, selecting rute refinements associated with high accuracies, and 
iterating a rule refinement, filtering and evaluation procedure (32 to 38) to identify any 
refined rule usable to test data. 

15 
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